library(rgugik)
library(sf)Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
substring(commune_names$TERYT, 1, 2) -> vpom
commune_names |> subset(subset = vpom == "22") -> pom0As was mentioned above in section Section 2.1, spatial vector geometries are most often furnished with identification keys, permitting other data using the same key or index identifiers to be added correctly. In Poland, they have been known as "TERYT", and are fairly stable over time, though border changes can occur, with new key values appearing and old values ceasing to be used, leading to breaks of series. The rgugik package contains a data object listing these keys together with unit names, but matching on names can be troublesome. First we will subset this object to the Pomeranian voivodeship:
library(rgugik)
library(sf)Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
substring(commune_names$TERYT, 1, 2) -> vpom
commune_names |> subset(subset = vpom == "22") -> pom0then retrieve the boundaries of municipalities in Pomerania using "TERYT" rather than using the names of municipalities, which are duplicated in other voivodeships:
borders_get(TERYT = pom0$TERYT) -> pomMerging the pom object containing the "TERYT" keys and municipality names, with pom_sf, including the "TERYT" keys and municipality boundaries is easy, as the keys are both character objects - this matters because some keys may start with 0, zero, and be wrongly read as integers, dropping leading zeros. The key for this voivodeship is "22", so this problem is not encountered here, but often occurs. Should an identifying key with leading zeros be read as integer, it can be restored using base::formatC after checking the required string width:
ID <- as.integer("0012345")
str(ID) int 12345
formatC(ID, format="d", width=7, flag="0")[1] "0012345"
Note that the "TERYT" key values returned in the pom_sf object include an underscore between the six digit territorial code and the trailing digit expressing the type of municipality.
dim(pom)[1] 123 2
pom |> sapply(class) |> str()List of 2
$ TERYT: chr "character"
$ geom : chr [1:2] "sfc_MULTIPOLYGON" "sfc"
str(pom$TERYT) chr [1:123] "226401_1" "221101_1" "221106_2" "220605_2" "221207_2" ...
str(pom0)tibble [123 × 2] (S3: tbl_df/tbl/data.frame)
$ NAME : chr [1:123] "Sopot" "Hel" "Krokowa" "Liniewo" ...
$ TERYT: chr [1:123] "2264011" "2211011" "2211062" "2206052" ...
The underscore needs to be removed before finally checking that they match:
pom$TERYT <- sub("_", "", pom$TERYT)
any(is.na(match(pom$TERYT, pom0$TERYT)))[1] FALSE
Finally, the merging operation may be carried out:
pom |> merge(pom0, by = "TERYT") -> pom1
pom1Simple feature collection with 123 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 350809.9 ymin: 626314.8 xmax: 542048.1 ymax: 775019.1
Projected CRS: ETRF2000-PL / CS92
First 10 features:
TERYT NAME geometry
1 2201012 Borzytuchom MULTIPOLYGON (((392568.1 69...
2 2201023 Bytów MULTIPOLYGON (((399745.2 68...
3 2201032 Czarna Dąbrówka MULTIPOLYGON (((415365.5 72...
4 2201042 Kołczygłowy MULTIPOLYGON (((382369 6921...
5 2201052 Lipnica MULTIPOLYGON (((392500.3 67...
6 2201063 Miastko MULTIPOLYGON (((366331.6 67...
7 2201072 Parchowo MULTIPOLYGON (((414238.3 69...
8 2201082 Studzienice MULTIPOLYGON (((405862.1 68...
9 2201092 Trzebielino MULTIPOLYGON (((376562.5 69...
10 2201102 Tuchomie MULTIPOLYGON (((384623.5 68...
Having methodically made a first merge, we can move forward with three files at the municipality level from Statistics Poland’s local data bank. These were exported as comma (or semicolon) separated value (CSV) files on February 15, 2024 - dates matter, as online data banks update their contents as inaccuracies are detected. Currently, no package provides direct access to this data bank, and for moderate data volumes, registration is required. The first file is semicolon separated, and contains population data by age group and sex (category K3, group G7, subgroup P2577) for the end of the second half-year 2022:
pop22 <- read.csv("Datasets/sd/LUDN_2577_2022.csv", sep=";")As with most such data, the column names need adjustment. On reading, reserved characters in R are replaced by dots, and for further work, shorter column names are helpful. Current R versions support multibyte characters across all platforms, but some files with single-byte characters can be encountered, in which case judicious use of base::iconv may be needed.
sapply(pop22, class) Kod
"integer"
Nazwa
"character"
w.wieku.przedprodukcyjnym...14.lat.i.mniej.mężczyźni.2022..osoba.
"integer"
w.wieku.przedprodukcyjnym...14.lat.i.mniej.kobiety.2022..osoba.
"integer"
w.wieku.produkcyjnym..15.59.lat.kobiety..15.64.lata.mężczyźni.mężczyźni.2022..osoba.
"integer"
w.wieku.produkcyjnym..15.59.lat.kobiety..15.64.lata.mężczyźni.kobiety.2022..osoba.
"integer"
w.wieku.poprodukcyjnym.mężczyźni.2022..osoba.
"integer"
w.wieku.poprodukcyjnym.kobiety.2022..osoba.
"integer"
X
"logical"
pop22$Kod <- as.character(pop22$Kod)
names(pop22)[3:8] <- c("m_u15", "f_u15",
"m_15_64", "f_15_59", "m_o65", "k_o60")
pop22$type <- factor(substring(pop22$Kod, 7, 7),
levels = c("1", "2", "3"),
labels = c("urban", "rural", "urban_rural"))
names(pop22) [1] "Kod" "Nazwa" "m_u15" "f_u15" "m_15_64" "f_15_59" "m_o65"
[8] "k_o60" "X" "type"
The key column was read as integer, so needs correcting; otherwise the remaining columns seem to be formatted acceptably. A new column is created as a factor showing the type of municipality, as a factor In addition, we The "X" column contains no data, and is dropped from the merging operation. We could overwrite the cumulating object on merge; here we choose to create a new object at each merge. The key is called "TERYT" in pom_sf and "Kod" in pop92, specified with by.x and by.y arguments:
pom1 |> merge(pop22[, -9], by.x = "TERYT",
by.y = "Kod") -> pom2
pom2Simple feature collection with 123 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 350809.9 ymin: 626314.8 xmax: 542048.1 ymax: 775019.1
Projected CRS: ETRF2000-PL / CS92
First 10 features:
TERYT NAME Nazwa m_u15 f_u15 m_15_64 f_15_59
1 2201012 Borzytuchom Borzytuchom (2) 406 387 1181 1039
2 2201023 Bytów Bytów (3) 2306 2162 8236 7373
3 2201032 Czarna Dąbrówka Czarna Dąbrówka (2) 610 596 1859 1556
4 2201042 Kołczygłowy Kołczygłowy (2) 387 308 1425 1203
5 2201052 Lipnica Lipnica (2) 531 449 1702 1463
6 2201063 Miastko Miastko (3) 1424 1331 6058 5021
7 2201072 Parchowo Parchowo (2) 416 382 1237 1106
8 2201082 Studzienice Studzienice (2) 346 352 1206 1066
9 2201092 Trzebielino Trzebielino (2) 350 299 1213 998
10 2201102 Tuchomie Tuchomie (2) 400 366 1462 1265
m_o65 k_o60 type geometry
1 172 301 rural MULTIPOLYGON (((392568.1 69...
2 1794 3250 urban_rural MULTIPOLYGON (((399745.2 68...
3 406 657 rural MULTIPOLYGON (((415365.5 72...
4 293 464 rural MULTIPOLYGON (((382369 6921...
5 389 611 rural MULTIPOLYGON (((392500.3 67...
6 1430 2773 urban_rural MULTIPOLYGON (((366331.6 67...
7 243 414 rural MULTIPOLYGON (((414238.3 69...
8 249 406 rural MULTIPOLYGON (((405862.1 68...
9 250 446 rural MULTIPOLYGON (((376562.5 69...
10 270 420 rural MULTIPOLYGON (((384623.5 68...
Having added the population group counts by sex for the Pomeranian municipalities, we create a new column "pop" summing the total population at year end 2022. We then calculate municipality areas, and divide population by area (in square kilometres) to get the population density:
pom2 |> st_drop_geometry() |> subset(select = 4:9) |>
apply(1, function(x) sum(x)) -> pom2$pop
pom2 |> st_area() |>
units::set_units("km2") -> pom2$area_km2
pom2 |> st_drop_geometry() |>
subset(select = c(pop, area_km2)) |>
apply(1, function(x) x[1]/x[2]) -> pom2$density
pom2Simple feature collection with 123 features and 13 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 350809.9 ymin: 626314.8 xmax: 542048.1 ymax: 775019.1
Projected CRS: ETRF2000-PL / CS92
First 10 features:
TERYT NAME Nazwa m_u15 f_u15 m_15_64 f_15_59
1 2201012 Borzytuchom Borzytuchom (2) 406 387 1181 1039
2 2201023 Bytów Bytów (3) 2306 2162 8236 7373
3 2201032 Czarna Dąbrówka Czarna Dąbrówka (2) 610 596 1859 1556
4 2201042 Kołczygłowy Kołczygłowy (2) 387 308 1425 1203
5 2201052 Lipnica Lipnica (2) 531 449 1702 1463
6 2201063 Miastko Miastko (3) 1424 1331 6058 5021
7 2201072 Parchowo Parchowo (2) 416 382 1237 1106
8 2201082 Studzienice Studzienice (2) 346 352 1206 1066
9 2201092 Trzebielino Trzebielino (2) 350 299 1213 998
10 2201102 Tuchomie Tuchomie (2) 400 366 1462 1265
m_o65 k_o60 type geometry pop area_km2
1 172 301 rural MULTIPOLYGON (((392568.1 69... 3486 108.5179 [km^2]
2 1794 3250 urban_rural MULTIPOLYGON (((399745.2 68... 25121 196.6898 [km^2]
3 406 657 rural MULTIPOLYGON (((415365.5 72... 5684 297.6368 [km^2]
4 293 464 rural MULTIPOLYGON (((382369 6921... 4080 169.9070 [km^2]
5 389 611 rural MULTIPOLYGON (((392500.3 67... 5145 308.6255 [km^2]
6 1430 2773 urban_rural MULTIPOLYGON (((366331.6 67... 18037 466.1825 [km^2]
7 243 414 rural MULTIPOLYGON (((414238.3 69... 3798 131.0944 [km^2]
8 249 406 rural MULTIPOLYGON (((405862.1 68... 3625 176.0147 [km^2]
9 250 446 rural MULTIPOLYGON (((376562.5 69... 3556 225.0631 [km^2]
10 270 420 rural MULTIPOLYGON (((384623.5 68... 4183 110.0614 [km^2]
density
1 32.12374
2 127.71885
3 19.09710
4 24.01314
5 16.67069
6 38.69085
7 28.97149
8 20.59487
9 15.80001
10 38.00605
The next file contains counts of farms from the Agricultural census reporting farm income from the farm (category K34, group G637, subgroup P4139). While this file was exported from the local data bank in the same way as the previous one, it is comma-separated:
ag20 <- read.csv("Datasets/sd/POWS_4139_2020.csv", sep=",")Again, column names require simplification, but here there is no spurious "X" column - we just drop the municipality name from the merge:
ag20$Kod <- as.character(ag20$Kod)
names(ag20)[3:7] <- c("farm", "non_farm_business",
"non_farm_wages", "pension", "other_non_wage")
pom2 |> merge(ag20[,-2], by.x = "TERYT",
by.y = "Kod") -> pom3Finally, and because the agricultural census data are counts of farms, we need counts of inhabited dwellings from the population census 2021 (category K31, group G645, subgroup P4383), especially for urban municipalities; the merging process follows the lines of those above:
cen21 <- read.csv("Datasets/sd/NARO_4383_2021.csv", sep=";")cen21$Kod <- as.character(cen21$Kod)
names(cen21)[3] <- "dwellings"
pom3 |> merge(cen21[, -c(2, 4)], by.x = "TERYT",
by.y = "Kod") -> pom4
names(pom4) [1] "TERYT" "NAME" "Nazwa"
[4] "m_u15" "f_u15" "m_15_64"
[7] "f_15_59" "m_o65" "k_o60"
[10] "type" "pop" "area_km2"
[13] "density" "farm" "non_farm_business"
[16] "non_farm_wages" "pension" "other_non_wage"
[19] "dwellings" "geometry"
In conclusion, for completeness, let us assign aggregation levels to the newly merged values:
agrs <- factor(c(rep("identity", 3), rep("aggregate", 6),
"identity", rep("aggregate", 9)),
levels=c("constant", "aggregate", "identity"))
names(agrs) <- names(st_drop_geometry(pom4))
pom4 |> st_set_agr(agrs) -> pom5For many purposes, maps themselves are spatial data. Observations are located in space as points, lines or polygons, and their placing yields useful contextual information when plotted. Using the plot methods on terra and sf, we can display the elevation raster as a background, and show the municipality boundaries with their ID keys:
library(elevatr)elevatr v0.99.0 NOTE: Version 0.99.0 of 'elevatr' uses 'sf' and 'terra'. Use
of the 'sp', 'raster', and underlying 'rgdal' packages by 'elevatr' is being
deprecated; however, get_elev_raster continues to return a RasterLayer. This
will be dropped in future versions, so please plan accordingly.
library(terra)terra 1.7.71
maule_sf |> get_elev_raster(z = 7,
clip = "locations", neg_to_na = TRUE) |>
rast() -> maule_elevMosaicing & Projecting
Clipping DEM to locations
Note: Elevation units are in meters.
plot(maule_elev, col=rev(hcl.colors(50, "Dark Mint")))
plot(st_geometry(maule_sf), add=TRUE)
text(st_coordinates(st_centroid(st_geometry(maule_sf))),
label=maule_sf$codigo_comuna)
mapsf provides fairly flexible map creation functionality for overlaying layers of map information on a graphics output device, following the same order as above, but permitting the improvement of label positioning. Above, we set the palette to be used for elevation to "Dark Mint", which is the default for mapsf::mf_raster:
library(mapsf)
mf_raster(maule_elev, leg_pos="bottomleft",
leg_title="elevation (m)", leg_horiz=TRUE,
leg_size=0.8, leg_val_rnd=0)
mf_map(st_geometry(maule_sf), add=TRUE, lwd=2,
border="grey60", col="transparent")
mf_label(maule_sf, var="codigo_comuna", overlap=FALSE,
halo=2)
tmap is about to be upgraded, and is very concise in expressing map construction by adding successive layers to the output graphic with the + operator:
library(tmap)Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
remotes::install_github('r-tmap/tmap')
tm_shape(maule_elev) + tm_raster(n=15,
palette=rev(hcl.colors(7, "Dark Mint"))) +
tm_shape(maule_sf) + tm_borders() +
tm_text("codigo_comuna")
Using the simplest approach, we can plot the positions of the take-away outlets registered for Talca city in OpenStreetMap, and see that they are very clustered within the boundary of the municipality:
plot(st_geometry(talca_sf))
plot(st_geometry(talca_takeaways), pch=4, col=3,
cex=2, lwd=2, add=TRUE)
Since it would be very useful to obtain more locational context, we can use mapview to view the take-away outlets on a web map background, so that we can interact with the point locations. The mapview methods convert the object to be displayed to "OGC:WGS84" if necessary, then to Web Mercator for display:
library(mapview)
mapview(talca_takeaways)Interactive map of take-away outlets in Talca
Moving to the data set for municipalities in the Polish voivodeship of Pomerania, we can attempt to show the density variable on an interactive map:
library(mapview)
mapview(pom5, zcol="density")Interactive map of Pomeramia municipalities from rgugik, population density per square kilometre
As may be seen in the interactive map, the municipality boundaries do match the Baltic sea coastline quite well, but the jurisdiction of the municipalities extends to some coastal waters in the Gulf of Gdańsk and Zalew Wiślany; we did after all download administrative municipality boundaries using rgugik. We need to find another source of boundaries that agree with the coastline, and then need to take the intersection of the two sets od polygons, leaving only areas present in both. giscoR provides access to boundaries for European NUTS regions, in geographical coordinates by default. We then need to transform to the Transverse Mercator projection used for the download from rgugik before carrying out the intersection:
library(giscoR)
pom_gisco <- gisco_get_nuts(year="2021", resolution="01",
spatialtype="RG", nuts_id="PL63")
pom_gisco_tm <- st_transform(pom_gisco, "EPSG:2180")
pom6 <- st_intersection(pom5, pom_gisco_tm)Warning: attribute variables are assumed to be spatially constant throughout
all geometries
On occasion, topological operations on two objects like st_intersection fail because the two geometries have different coordinate reference systems, in which case, st_transform should be used first to bring them into agreement. If in any topological operation or predicate (such as a query like st_intersects), it is found that a geometry is invalid, for example the self-intersection of a polygon boundary, use st_make_valid on the invalid object. It can be the case that the geometry cannot be repaired, as some data providers do not check the provided geometries for validity. Since an intersection operation can yield points and lines as well as polygons, we can check the input and output geometries and re-establish their original geometry types; in this case only polygons and multipolygons were output:
pom5 |> st_geometry_type() |> table() -> tab5; tab5[tab5>0]MULTIPOLYGON
123
pom6 |> st_geometry_type() |> table() -> tab6; tab6[tab6>0]
POLYGON MULTIPOLYGON
119 4
pom6 |> st_cast("MULTIPOLYGON") -> pom6Since the areas of some municipalities have now changed, we need to re-caluclate both areas and densities:
pom6 |> st_area() |>
units::set_units("km2") -> pom6$area_km2
pom6 |> st_drop_geometry() |>
subset(select = c(pop, area_km2)) |>
apply(1, function(x) x[1]/x[2]) -> pom6$densityAs can be seen from the figure below, the graphical representation is now more legible; coastlines and boundaries are typically read as contextual location information.
mapview(pom6, zcol="density")Interactive map of Pomeramia municipalities from rgugik, corrected population density per square kilometre
Two minor points before we move on, first that the distribution of the variable which we want to map does matter. The population density of Pomeranian municipalities is far from being symmetric, as this density plot shows:
plot(density(pom6$density))
In this case, in thematic cartography we should create class intervals for display using quantiles or other suitable methods. mapsf provides a geometric progression "geom" method for creating class intervals for skewed variables:
mf_map(pom6, var="density", type="choro", breaks="geom",
nbreaks=7)
The choice of breaks style does matter for communicating the important traits of the variable in question in thematic cartography, here as a choropleth map.
The second display problem is associated with showing the kind of classes implied by the data. For a categorical variable and few classes, the classes are taken from the categories, and unlike the examples above using sequential palettes, this uses a qualitative palette:
mf_map(pom6, var="type", type="typo")
When the variable is crosses a specified mid-point, like regression residuals, a sequential palette is not appropriate, and a diverging palette should be chosen. tmap::tm_fill attempts to make reasonable choices in this situation; here the deviance of the proportion of the municipality population less than 16 years old from its mean:
pom6$young <- 100*(pom6$m_u15 + pom6$f_u15)/pom6$pop
pom6$young_res <- residuals(lm(young ~ 1, data=pom6))tm_shape(pom6) + tm_fill("young_res", style="quantile",
n=11, midpoint=0, palette="RdBu") + tm_borders()
For the present, we will only consider the reading and writing of spatial vector data using functions from sf: st_read and st_write. Further, while it is still regretably the case that many organisations still use the "ESRI Shapefile" (actually most often a bundle of at least four files) for sharing data, the GeoPackage format, "GPKG", should be preferred because it represents CRS properly, does not restrict field (variable) names to 10 characters, does use multibyte characters, and does store numerical fields in binary rather than text format. For commonly used formats, the format-specifiic driver is chosen from the file extension:
fn <- paste0(io_dir, "/pom6.gpkg")
if (file.exists(fn)) tull <- file.remove(fn)
try(st_write(pom6, fn))Writing layer `pom6' to data source `Input_output/pom6.gpkg' using driver `GPKG'
Creating field FID failed.
Error in eval(expr, envir, enclos) : Layer creation failed.
Unfortunately, our intersection of the GISCO voivodeship outline added a number of variables to the data set, including one called "FID" , which is a reserved word for many spatial data formats, meaning “Feature IDentifier”, and must be an integer. Now it is a character vector, so we replace it by a compliant integer vector with unique values; we could also have dropped it:
str(pom6$FID) chr [1:123] "PL63" "PL63" "PL63" "PL63" "PL63" "PL63" "PL63" "PL63" "PL63" ...
pom6$FID <- 1:nrow(pom6)
st_write(pom6, dsn=fn)Writing layer `pom6' to data source `Input_output/pom6.gpkg' using driver `GPKG'
Writing 123 features with 31 fields and geometry type Multi Polygon.
To read, st_read will try to identify the format and use the appropriate driver automatically; if the data source containd multiple layers, the first will be read by default:
pom6a <- st_read(dsn=fn)Reading layer `pom6' from data source
`/home/rsb/und/PG_AGII_2sem/Input_output/pom6.gpkg' using driver `GPKG'
Simple feature collection with 123 features and 30 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 350923.1 ymin: 626314.8 xmax: 542027.9 ymax: 774909.2
Projected CRS: ETRF2000-PL / CS92
When we do a simple check to see if output and input are identical - and find that they are not:
isTRUE(all.equal(st_geometry(pom6), st_geometry(pom6a)))[1] FALSE
When written to the external format, the order of the columns changes, as the geometry column is moved to the end on reading. This re-orders the columns, which contain the same information as before but the order differs:
names(pom6) [1] "TERYT" "NAME" "Nazwa"
[4] "m_u15" "f_u15" "m_15_64"
[7] "f_15_59" "m_o65" "k_o60"
[10] "type" "pop" "area_km2"
[13] "density" "farm" "non_farm_business"
[16] "non_farm_wages" "pension" "other_non_wage"
[19] "dwellings" "COAST_TYPE" "FID"
[22] "MOUNT_TYPE" "NAME_LATN" "CNTR_CODE"
[25] "NUTS_ID" "LEVL_CODE" "URBN_TYPE"
[28] "NUTS_NAME" "geo" "geometry"
[31] "young" "young_res"
names(pom6a) [1] "TERYT" "NAME" "Nazwa"
[4] "m_u15" "f_u15" "m_15_64"
[7] "f_15_59" "m_o65" "k_o60"
[10] "type" "pop" "area_km2"
[13] "density" "farm" "non_farm_business"
[16] "non_farm_wages" "pension" "other_non_wage"
[19] "dwellings" "COAST_TYPE" "MOUNT_TYPE"
[22] "NAME_LATN" "CNTR_CODE" "NUTS_ID"
[25] "LEVL_CODE" "URBN_TYPE" "NUTS_NAME"
[28] "geo" "young" "young_res"
[31] "geom"
In addition, the "area_km2" column loses its units definition, and the "type" column becomes a character vector, not a factor. Such differences are trivial and to be expected, but imply that if an object is to shared with collaborators, any shared script should use the object as read from file, especially if column order rather than column names are used in analysis. After reading, we can restore the representation of the columns before starting work:
pom6a$type <- factor(pom6a$type)
pom6a$area_km2 <- units::set_units(pom6a$area_km2, "km2")